Experiments in Cross-Language Morphological Annotation Transfer
نویسندگان
چکیده
Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger ([3]), a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance.
منابع مشابه
O. Scrivner, T. Gilmanov SWIFT ALIGNER: A TOOL FOR THE VISUALIZATION AND CORRECTION OF WORD ALIGNMENT AND FOR CROSS LANGUAGE TRANSFER
It is well known that parallel corpora are valuable linguistic resources. One of the benefits of such corpora is that they allow for the building an annotated corpus for resource-poor languages via crosslanguage transfer. That is, given accurate alignment between a word from a source language and its equivalent in a target language, some linguistic information, such as part-of-speech tags or sy...
متن کاملCross-language transfer of semantic annotation via targeted crowdsourcing: task design and evaluation
The development of a natural language speech application requires the process of semantic annotation. Moreover multilingual porting of speech applications increases the cost and complexity of the annotation task. In this paper we address the problem of transferring the semantic annotation of the source language corpus to a low-resource target language via crowdsourcing. The current crowdsourcin...
متن کاملA Systematic Evaluation of Concept-based Cross-Lingual Information Retrieval in the Medical Domain
The paper describes experiments and results of the MuchMore project1, which is concerned with a systematic comparison of concept-based and corpus-based methods in cross-language information retrieval (CLIR) in the medical domain. Primary goals of the project are to develop and evaluate methods for the effective use of multilingual thesauri in the semantic annotation of English and German medica...
متن کاملThree Issues in Cross-Language Frame Information Transfer
In this paper we address the task of transferring FrameNet annotations from an English corpus to an aligned Italian corpus. Experiments were carried out on an English-Italian bitext extracted from the Europarl corpus and on a set of selected sentences from the English FrameNet corpus that have been manually translated into Italian. Our research activity is aimed at answering the following three...
متن کاملSemantic annotation for concept-based cross-language medical information retrieval
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information. Linguistic processing includes part-of-speech ...
متن کامل